Problem Description:
You as a Data scientist at AllLife bank have to build a model that will help the marketing department to identify the potential customers who have a higher probability of purchasing the loan.
To predict whether a liability customer will buy a personal loan or not. Which variables are most significant. Which segment of customers should be targeted more.
#import libraries
import pandas as pd
import numpy as np
from sklearn import metrics
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn import tree
# Libraries to help with data visualization
import matplotlib.pyplot as plt
import seaborn as sns
# Command to tell Python to actually display the graphs
%matplotlib inline
path = "Loan_Modelling.csv"
data = pd.read_csv(path)
data.shape
(5000, 14)
data.head(10)
| ID | Age | Experience | Income | ZIPCode | Family | CCAvg | Education | Mortgage | Personal_Loan | Securities_Account | CD_Account | Online | CreditCard | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 25 | 1 | 49 | 91107 | 4 | 1.6 | 1 | 0 | 0 | 1 | 0 | 0 | 0 |
| 1 | 2 | 45 | 19 | 34 | 90089 | 3 | 1.5 | 1 | 0 | 0 | 1 | 0 | 0 | 0 |
| 2 | 3 | 39 | 15 | 11 | 94720 | 1 | 1.0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
| 3 | 4 | 35 | 9 | 100 | 94112 | 1 | 2.7 | 2 | 0 | 0 | 0 | 0 | 0 | 0 |
| 4 | 5 | 35 | 8 | 45 | 91330 | 4 | 1.0 | 2 | 0 | 0 | 0 | 0 | 0 | 1 |
| 5 | 6 | 37 | 13 | 29 | 92121 | 4 | 0.4 | 2 | 155 | 0 | 0 | 0 | 1 | 0 |
| 6 | 7 | 53 | 27 | 72 | 91711 | 2 | 1.5 | 2 | 0 | 0 | 0 | 0 | 1 | 0 |
| 7 | 8 | 50 | 24 | 22 | 93943 | 1 | 0.3 | 3 | 0 | 0 | 0 | 0 | 0 | 1 |
| 8 | 9 | 35 | 10 | 81 | 90089 | 3 | 0.6 | 2 | 104 | 0 | 0 | 0 | 1 | 0 |
| 9 | 10 | 34 | 9 | 180 | 93023 | 1 | 8.9 | 3 | 0 | 1 | 0 | 0 | 0 | 0 |
#dropping ID column since it shouldn't be used in our predictions
data.drop('ID',axis=1,inplace=True)
data
| Age | Experience | Income | ZIPCode | Family | CCAvg | Education | Mortgage | Personal_Loan | Securities_Account | CD_Account | Online | CreditCard | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 25 | 1 | 49 | 91107 | 4 | 1.6 | 1 | 0 | 0 | 1 | 0 | 0 | 0 |
| 1 | 45 | 19 | 34 | 90089 | 3 | 1.5 | 1 | 0 | 0 | 1 | 0 | 0 | 0 |
| 2 | 39 | 15 | 11 | 94720 | 1 | 1.0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
| 3 | 35 | 9 | 100 | 94112 | 1 | 2.7 | 2 | 0 | 0 | 0 | 0 | 0 | 0 |
| 4 | 35 | 8 | 45 | 91330 | 4 | 1.0 | 2 | 0 | 0 | 0 | 0 | 0 | 1 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 4995 | 29 | 3 | 40 | 92697 | 1 | 1.9 | 3 | 0 | 0 | 0 | 0 | 1 | 0 |
| 4996 | 30 | 4 | 15 | 92037 | 4 | 0.4 | 1 | 85 | 0 | 0 | 0 | 1 | 0 |
| 4997 | 63 | 39 | 24 | 93023 | 2 | 0.3 | 3 | 0 | 0 | 0 | 0 | 0 | 0 |
| 4998 | 65 | 40 | 49 | 90034 | 3 | 0.5 | 2 | 0 | 0 | 0 | 0 | 1 | 0 |
| 4999 | 28 | 4 | 83 | 92612 | 3 | 0.8 | 1 | 0 | 0 | 0 | 0 | 1 | 1 |
5000 rows × 13 columns
data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 5000 entries, 0 to 4999 Data columns (total 13 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Age 5000 non-null int64 1 Experience 5000 non-null int64 2 Income 5000 non-null int64 3 ZIPCode 5000 non-null int64 4 Family 5000 non-null int64 5 CCAvg 5000 non-null float64 6 Education 5000 non-null int64 7 Mortgage 5000 non-null int64 8 Personal_Loan 5000 non-null int64 9 Securities_Account 5000 non-null int64 10 CD_Account 5000 non-null int64 11 Online 5000 non-null int64 12 CreditCard 5000 non-null int64 dtypes: float64(1), int64(12) memory usage: 507.9 KB
data.describe(include='all').T
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| Age | 5000.0 | 45.338400 | 11.463166 | 23.0 | 35.0 | 45.0 | 55.0 | 67.0 |
| Experience | 5000.0 | 20.104600 | 11.467954 | -3.0 | 10.0 | 20.0 | 30.0 | 43.0 |
| Income | 5000.0 | 73.774200 | 46.033729 | 8.0 | 39.0 | 64.0 | 98.0 | 224.0 |
| ZIPCode | 5000.0 | 93169.257000 | 1759.455086 | 90005.0 | 91911.0 | 93437.0 | 94608.0 | 96651.0 |
| Family | 5000.0 | 2.396400 | 1.147663 | 1.0 | 1.0 | 2.0 | 3.0 | 4.0 |
| CCAvg | 5000.0 | 1.937938 | 1.747659 | 0.0 | 0.7 | 1.5 | 2.5 | 10.0 |
| Education | 5000.0 | 1.881000 | 0.839869 | 1.0 | 1.0 | 2.0 | 3.0 | 3.0 |
| Mortgage | 5000.0 | 56.498800 | 101.713802 | 0.0 | 0.0 | 0.0 | 101.0 | 635.0 |
| Personal_Loan | 5000.0 | 0.096000 | 0.294621 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
| Securities_Account | 5000.0 | 0.104400 | 0.305809 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
| CD_Account | 5000.0 | 0.060400 | 0.238250 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
| Online | 5000.0 | 0.596800 | 0.490589 | 0.0 | 0.0 | 1.0 | 1.0 | 1.0 |
| CreditCard | 5000.0 | 0.294000 | 0.455637 | 0.0 | 0.0 | 0.0 | 1.0 | 1.0 |
#ID Age Experience Income ZIPCode Family CCAvg Education Mortgage Personal_Loan Securities_Account CD_Account Online CreditCard
# sns.histplot(data=data, x='price')
data
| Age | Experience | Income | ZIPCode | Family | CCAvg | Education | Mortgage | Personal_Loan | Securities_Account | CD_Account | Online | CreditCard | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 25 | 1 | 49 | 91107 | 4 | 1.6 | 1 | 0 | 0 | 1 | 0 | 0 | 0 |
| 1 | 45 | 19 | 34 | 90089 | 3 | 1.5 | 1 | 0 | 0 | 1 | 0 | 0 | 0 |
| 2 | 39 | 15 | 11 | 94720 | 1 | 1.0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
| 3 | 35 | 9 | 100 | 94112 | 1 | 2.7 | 2 | 0 | 0 | 0 | 0 | 0 | 0 |
| 4 | 35 | 8 | 45 | 91330 | 4 | 1.0 | 2 | 0 | 0 | 0 | 0 | 0 | 1 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 4995 | 29 | 3 | 40 | 92697 | 1 | 1.9 | 3 | 0 | 0 | 0 | 0 | 1 | 0 |
| 4996 | 30 | 4 | 15 | 92037 | 4 | 0.4 | 1 | 85 | 0 | 0 | 0 | 1 | 0 |
| 4997 | 63 | 39 | 24 | 93023 | 2 | 0.3 | 3 | 0 | 0 | 0 | 0 | 0 | 0 |
| 4998 | 65 | 40 | 49 | 90034 | 3 | 0.5 | 2 | 0 | 0 | 0 | 0 | 1 | 0 |
| 4999 | 28 | 4 | 83 | 92612 | 3 | 0.8 | 1 | 0 | 0 | 0 | 0 | 1 | 1 |
5000 rows × 13 columns
We want to see how our independent variables relate to our dependent variable Personal Loan
sns.histplot(data=data, x='Age');
There is no clear pattern here.
sns.histplot(data=data, x='Experience');
sns.histplot(data=data, x='Income');
Income appears to be right skewed
sns.histplot(data=data, x='ZIPCode');
sns.histplot(data=data, x='Family');
sns.histplot(data=data, x='CCAvg');
CCAvg appears to be right skewed
sns.histplot(data=data, x='Education');
The largest group in terms of education is undergraduates
sns.histplot(data=data, x='Mortgage');
Most people don't have a morgage
sns.histplot(data=data, x='Securities_Account');
Most customers don't have a securities account with the bank
sns.histplot(data=data, x='CD_Account');
Most customers dont have a CD account with the bank
sns.histplot(data=data, x='Online');
Most customers use online banking
sns.histplot(data=data, x='CreditCard');
Most customers have a credit card exclusively with all life bank
Some independent variables had interesting results when graphed using a box plot. We are able to see some outliers.
sns.boxplot(data=data, x="Income")
<AxesSubplot:xlabel='Income'>
sns.boxplot(data=data, x="CCAvg")
<AxesSubplot:xlabel='CCAvg'>
People with Average spending on credit cards per month over 5000 dollars are outliers Most people are one the lower end of credit card expenditures. There are a lot of outliers
sns.boxplot(data=data, x="Mortgage")
<AxesSubplot:xlabel='Mortgage'>
This makes sense because most people don't have a morgage. People with mortgages are outliers.
sns.pairplot(data,diag_kind='kde')
<seaborn.axisgrid.PairGrid at 0x7fa3cdf52df0>
plot_corr(data)
Plotting correlations we see that Age and experience have a high correlation. Infact they have a maximum correlation to each other Income also has a significant correlation with ccAvg
X = data.drop("Personal_Loan" , axis=1)
Y = data.pop("Personal_Loan")
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=.30, random_state=1)
#check the split
print("{0:0.2f}% data is in training set".format((len(X_train)/len(data.index)) * 100))
print("{0:0.2f}% data is in test set".format((len(X_test)/len(data.index)) * 100))
70.00% data is in training set 30.00% data is in test set
We find the split is 70/30 which is what we were expecting
dTree = DecisionTreeClassifier(criterion = 'gini', random_state=1)
dTree.fit(X_train, Y_train)
DecisionTreeClassifier(random_state=1)
print("Accuracy on training set : ",dTree.score(X_train, Y_train))
print("Accuracy on test set : ",dTree.score(X_test, Y_test))
Accuracy on training set : 1.0 Accuracy on test set : 0.98
The Accuracy between the training set and test set are close to the same which is really good. We want them to be close
#Checking number of positives
Y.sum(axis = 0)
480
480 people will accept the Personal_loan offer within 98 percent accuracy. Out of 1500 evaluated
Consideration: Maybe we don't need such a high degree of accuracy becasue the cost of sending out an offer and the customer rejecting it is really low and But the cost of a customer who might accept but doen't get an offer makes the bank lose out on potential revenue
480/1500 # percentage
0.32
The model is telling us to market the personal loan to about 32% of the test data population
Adding some helper methods to calculate recall and generate a confusion matrix
## Function to create confusion matrix
def make_confusion_matrix(model,y_actual,labels=[1, 0]):
'''
model : classifier to predict values of X
y_actual : ground truth
'''
y_predict = model.predict(X_test)
cm=metrics.confusion_matrix( y_actual, y_predict, labels=[0, 1])
df_cm = pd.DataFrame(cm, index = [i for i in ["Actual - No","Actual - Yes"]],
columns = [i for i in ['Predicted - No','Predicted - Yes']])
group_counts = ["{0:0.0f}".format(value) for value in
cm.flatten()]
group_percentages = ["{0:.2%}".format(value) for value in
cm.flatten()/np.sum(cm)]
labels = [f"{v1}\n{v2}" for v1, v2 in
zip(group_counts,group_percentages)]
labels = np.asarray(labels).reshape(2,2)
plt.figure(figsize = (10,7))
sns.heatmap(df_cm, annot=labels,fmt='')
plt.ylabel('True label')
plt.xlabel('Predicted label')
## Function to calculate recall score
def get_recall_score(model):
'''
model : classifier to predict values of X
'''
pred_train = model.predict(X_train)
pred_test = model.predict(X_test)
print("Recall on training set : ",metrics.recall_score(Y_train,pred_train))
print("Recall on test set : ",metrics.recall_score(Y_test,pred_test))
Recalll is better than accuracy because it allows us to be sure we are captureing the true positives. Even if that means increasing false positives. We don't care as much about accidentally sending offers to someone who might reject the offer. The cost to send an offer is very low and the benefit of is very high if we send out extras and have more people accept offers
#generate a confustion matrix
make_confusion_matrix(dTree,Y_test)
# Recall on train and test
get_recall_score(dTree)
Recall on training set : 1.0 Recall on test set : 0.8859060402684564
feature_names = list(X.columns)
print(feature_names)
['Age', 'Experience', 'Income', 'ZIPCode', 'Family', 'CCAvg', 'Education', 'Mortgage', 'Securities_Account', 'CD_Account', 'Online', 'CreditCard']
plt.figure(figsize=(20,30))
tree.plot_tree(dTree,feature_names=feature_names,filled=True,fontsize=9,node_ids=True,class_names=True)
plt.show()
print(tree.export_text(dTree,feature_names=feature_names,show_weights=True))
|--- Income <= 116.50 | |--- CCAvg <= 2.95 | | |--- Income <= 106.50 | | | |--- weights: [2553.00, 0.00] class: 0 | | |--- Income > 106.50 | | | |--- Family <= 3.50 | | | | |--- Education <= 1.50 | | | | | |--- weights: [35.00, 0.00] class: 0 | | | | |--- Education > 1.50 | | | | | |--- Age <= 28.50 | | | | | | |--- weights: [0.00, 1.00] class: 1 | | | | | |--- Age > 28.50 | | | | | | |--- Age <= 41.50 | | | | | | | |--- weights: [16.00, 0.00] class: 0 | | | | | | |--- Age > 41.50 | | | | | | | |--- Age <= 48.50 | | | | | | | | |--- weights: [0.00, 2.00] class: 1 | | | | | | | |--- Age > 48.50 | | | | | | | | |--- weights: [12.00, 0.00] class: 0 | | | |--- Family > 3.50 | | | | |--- Experience <= 3.50 | | | | | |--- weights: [10.00, 0.00] class: 0 | | | | |--- Experience > 3.50 | | | | | |--- Age <= 60.00 | | | | | | |--- Experience <= 7.00 | | | | | | | |--- Age <= 29.50 | | | | | | | | |--- weights: [0.00, 1.00] class: 1 | | | | | | | |--- Age > 29.50 | | | | | | | | |--- weights: [2.00, 0.00] class: 0 | | | | | | |--- Experience > 7.00 | | | | | | | |--- weights: [0.00, 6.00] class: 1 | | | | | |--- Age > 60.00 | | | | | | |--- weights: [4.00, 0.00] class: 0 | |--- CCAvg > 2.95 | | |--- Income <= 92.50 | | | |--- CD_Account <= 0.50 | | | | |--- Age <= 26.50 | | | | | |--- weights: [0.00, 1.00] class: 1 | | | | |--- Age > 26.50 | | | | | |--- CCAvg <= 3.55 | | | | | | |--- CCAvg <= 3.35 | | | | | | | |--- Experience <= 13.00 | | | | | | | | |--- Age <= 33.50 | | | | | | | | | |--- weights: [3.00, 0.00] class: 0 | | | | | | | | |--- Age > 33.50 | | | | | | | | | |--- weights: [0.00, 1.00] class: 1 | | | | | | | |--- Experience > 13.00 | | | | | | | | |--- Income <= 82.50 | | | | | | | | | |--- weights: [23.00, 0.00] class: 0 | | | | | | | | |--- Income > 82.50 | | | | | | | | | |--- Income <= 83.50 | | | | | | | | | | |--- weights: [0.00, 1.00] class: 1 | | | | | | | | | |--- Income > 83.50 | | | | | | | | | | |--- weights: [5.00, 0.00] class: 0 | | | | | | |--- CCAvg > 3.35 | | | | | | | |--- Family <= 3.00 | | | | | | | | |--- weights: [0.00, 5.00] class: 1 | | | | | | | |--- Family > 3.00 | | | | | | | | |--- weights: [9.00, 0.00] class: 0 | | | | | |--- CCAvg > 3.55 | | | | | | |--- Income <= 81.50 | | | | | | | |--- weights: [43.00, 0.00] class: 0 | | | | | | |--- Income > 81.50 | | | | | | | |--- Income <= 83.50 | | | | | | | | |--- Age <= 45.50 | | | | | | | | | |--- weights: [7.00, 0.00] class: 0 | | | | | | | | |--- Age > 45.50 | | | | | | | | | |--- Age <= 54.50 | | | | | | | | | | |--- Family <= 3.50 | | | | | | | | | | | |--- weights: [0.00, 2.00] class: 1 | | | | | | | | | | |--- Family > 3.50 | | | | | | | | | | | |--- weights: [1.00, 0.00] class: 0 | | | | | | | | | |--- Age > 54.50 | | | | | | | | | | |--- weights: [2.00, 0.00] class: 0 | | | | | | | |--- Income > 83.50 | | | | | | | | |--- weights: [24.00, 0.00] class: 0 | | | |--- CD_Account > 0.50 | | | | |--- weights: [0.00, 5.00] class: 1 | | |--- Income > 92.50 | | | |--- Education <= 1.50 | | | | |--- CD_Account <= 0.50 | | | | | |--- Family <= 3.50 | | | | | | |--- Online <= 0.50 | | | | | | | |--- Age <= 55.00 | | | | | | | | |--- Family <= 2.50 | | | | | | | | | |--- weights: [12.00, 0.00] class: 0 | | | | | | | | |--- Family > 2.50 | | | | | | | | | |--- weights: [0.00, 1.00] class: 1 | | | | | | | |--- Age > 55.00 | | | | | | | | |--- weights: [0.00, 1.00] class: 1 | | | | | | |--- Online > 0.50 | | | | | | | |--- weights: [20.00, 0.00] class: 0 | | | | | |--- Family > 3.50 | | | | | | |--- Experience <= 21.50 | | | | | | | |--- weights: [0.00, 2.00] class: 1 | | | | | | |--- Experience > 21.50 | | | | | | | |--- weights: [1.00, 0.00] class: 0 | | | | |--- CD_Account > 0.50 | | | | | |--- Income <= 93.50 | | | | | | |--- weights: [1.00, 0.00] class: 0 | | | | | |--- Income > 93.50 | | | | | | |--- weights: [0.00, 5.00] class: 1 | | | |--- Education > 1.50 | | | | |--- Age <= 63.50 | | | | | |--- Mortgage <= 172.00 | | | | | | |--- CD_Account <= 0.50 | | | | | | | |--- Age <= 60.50 | | | | | | | | |--- weights: [0.00, 21.00] class: 1 | | | | | | | |--- Age > 60.50 | | | | | | | | |--- CCAvg <= 3.75 | | | | | | | | | |--- weights: [2.00, 0.00] class: 0 | | | | | | | | |--- CCAvg > 3.75 | | | | | | | | | |--- weights: [0.00, 1.00] class: 1 | | | | | | |--- CD_Account > 0.50 | | | | | | | |--- Family <= 2.00 | | | | | | | | |--- weights: [2.00, 0.00] class: 0 | | | | | | | |--- Family > 2.00 | | | | | | | | |--- weights: [0.00, 1.00] class: 1 | | | | | |--- Mortgage > 172.00 | | | | | | |--- Income <= 100.00 | | | | | | | |--- Family <= 2.50 | | | | | | | | |--- weights: [5.00, 0.00] class: 0 | | | | | | | |--- Family > 2.50 | | | | | | | | |--- weights: [0.00, 2.00] class: 1 | | | | | | |--- Income > 100.00 | | | | | | | |--- weights: [0.00, 3.00] class: 1 | | | | |--- Age > 63.50 | | | | | |--- weights: [2.00, 0.00] class: 0 |--- Income > 116.50 | |--- Education <= 1.50 | | |--- Family <= 2.50 | | | |--- weights: [375.00, 0.00] class: 0 | | |--- Family > 2.50 | | | |--- weights: [0.00, 47.00] class: 1 | |--- Education > 1.50 | | |--- weights: [0.00, 222.00] class: 1
# What is the importance of each feature in the tree
print (pd.DataFrame(dTree.feature_importances_, columns = ["Imp"], index = X_train.columns).sort_values(by = 'Imp', ascending = False))
Imp Education 0.401465 Income 0.308336 Family 0.169593 CCAvg 0.044408 Age 0.035708 CD_Account 0.025711 Experience 0.011203 Mortgage 0.003014 Online 0.000561 ZIPCode 0.000000 Securities_Account 0.000000 CreditCard 0.000000
The customer's Zipcodes, whether or not they had a securities_account and a credit card account had no predictiveness of whether the customer would accept a persional loan offer.
Their Experience, Whether they had a CD account, Age and CCAvg also had very little importance
importances = dTree.feature_importances_
indices = np.argsort(importances)
plt.figure(figsize=(12,12))
plt.title('Feature Importances')
plt.barh(range(len(indices)), importances[indices], color='violet', align='center')
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel('Relative Importance')
plt.show()
According to the graph above the Education, Income and Family are the most important variables. The Bank should focus on these when launching their campaigns
# To reduce the complexity of the tree we will set the max depth limit to 5
dTree1 = DecisionTreeClassifier(criterion = 'gini',max_depth=5,random_state=1)
dTree1.fit(X_train, Y_train)
DecisionTreeClassifier(max_depth=5, random_state=1)
plt.figure(figsize=(15,10))
tree.plot_tree(dTree1,feature_names=feature_names,filled=True,fontsize=9,node_ids=True,class_names=True)
plt.show()
#There is still a high recall on training set and a slightly lower recal on test set. Which is fine. We reduced overfitting.
get_recall_score(dTree1)
Recall on training set : 0.9516616314199395 Recall on test set : 0.8791946308724832
print (pd.DataFrame(dTree1.feature_importances_, columns = ["Imp"], index = X_train.columns).sort_values(by = 'Imp', ascending = False))
Imp Education 0.438816 Income 0.325524 Family 0.156494 CCAvg 0.041313 CD_Account 0.024794 Experience 0.009097 Age 0.003963 ZIPCode 0.000000 Mortgage 0.000000 Securities_Account 0.000000 Online 0.000000 CreditCard 0.000000
When reducing the dept of the tree the importance of Education and Income increases while the importance of family size decreases
importances = dTree1.feature_importances_
indices = np.argsort(importances)
plt.figure(figsize=(10,10))
plt.title('Feature Importances')
plt.barh(range(len(indices)), importances[indices], color='violet', align='center')
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel('Relative Importance')
plt.show()
We can see that the importance of Education, INcome and Family are in the same order. While age decreased in importance
Next we will focus on Hyperparameter tuning in an attempt to imporve the model.
from sklearn.model_selection import GridSearchCV
# Choose the type of classifier.
estimator = DecisionTreeClassifier(random_state=1)
# Grid of parameters to choose from
## add from article
parameters = {'max_depth': np.arange(1,10),
'min_samples_leaf': [1, 2, 5, 7, 10,15,20],
'max_leaf_nodes' : [2, 3, 5, 10],
'min_impurity_decrease': [0.001,0.01,0.1]
}
# Type of scoring used to compare parameter combinations
acc_scorer = metrics.make_scorer(metrics.recall_score)
# Run the grid search using a 5 fold cross validation
grid_obj = GridSearchCV(estimator, parameters, scoring=acc_scorer,cv=5)
grid_obj = grid_obj.fit(X_train, Y_train)
# Set the clf to the best combination of parameters
estimator = grid_obj.best_estimator_
# Fit the best algorithm to the data.
estimator.fit(X_train, Y_train)
make_confusion_matrix(estimator,Y_test)
The confusion matrix
True Positives - 131
True Negatives - 1341
False Positives (FP)- Type I error - 10
False Negatives (FN)Type II error - 18
#accuracy
print("Accuracy on training set : ",estimator.score(X_train, Y_train))
print("Accuracy on test set : ",estimator.score(X_test, Y_test))
#recall
get_recall_score(estimator)
Accuracy on training set : 0.9897142857142858 Accuracy on test set : 0.9813333333333333 Recall on training set : 0.9274924471299094 Recall on test set : 0.8791946308724832
We are able to achieve a high recall on test set
plt.figure(figsize=(15,10))
tree.plot_tree(estimator,feature_names=feature_names,filled=True,fontsize=9,node_ids=True,class_names=True)
plt.show()
The Tree we careate here is more simplified than the ones we started with and is more explainable
#printing the gini importance of each variable to what we are predicting
print (pd.DataFrame(estimator.feature_importances_, columns = ["Imp"], index = X_train.columns).sort_values(by = 'Imp', ascending = False))
Imp Education 0.447999 Income 0.328713 Family 0.155711 CCAvg 0.042231 CD_Account 0.025345 Age 0.000000 Experience 0.000000 ZIPCode 0.000000 Mortgage 0.000000 Securities_Account 0.000000 Online 0.000000 CreditCard 0.000000
# analyze reasonable alphas and gini impurites to seek further gains
clf = DecisionTreeClassifier(random_state=1)
path = clf.cost_complexity_pruning_path(X_train, Y_train)
ccp_alphas, impurities = path.ccp_alphas, path.impurities
pd.DataFrame(path)
| ccp_alphas | impurities | |
|---|---|---|
| 0 | 0.000000 | 0.000000 |
| 1 | 0.000223 | 0.001114 |
| 2 | 0.000268 | 0.002188 |
| 3 | 0.000359 | 0.003263 |
| 4 | 0.000381 | 0.003644 |
| 5 | 0.000381 | 0.004025 |
| 6 | 0.000381 | 0.004406 |
| 7 | 0.000381 | 0.004787 |
| 8 | 0.000409 | 0.006423 |
| 9 | 0.000476 | 0.006900 |
| 10 | 0.000508 | 0.007407 |
| 11 | 0.000582 | 0.007989 |
| 12 | 0.000593 | 0.009175 |
| 13 | 0.000641 | 0.011740 |
| 14 | 0.000769 | 0.014817 |
| 15 | 0.000792 | 0.017985 |
| 16 | 0.001552 | 0.019536 |
| 17 | 0.002333 | 0.021869 |
| 18 | 0.003024 | 0.024893 |
| 19 | 0.003294 | 0.028187 |
| 20 | 0.006473 | 0.034659 |
| 21 | 0.023866 | 0.058525 |
| 22 | 0.056365 | 0.171255 |
fig, ax = plt.subplots(figsize=(10,5))
ax.plot(ccp_alphas[:-1], impurities[:-1], marker='o', drawstyle="steps-post")
ax.set_xlabel("effective alpha")
ax.set_ylabel("total impurity of leaves")
ax.set_title("Total Impurity vs effective alpha for training set")
plt.show()
clfs = []
for ccp_alpha in ccp_alphas:
clf = DecisionTreeClassifier(random_state=1, ccp_alpha=ccp_alpha)
clf.fit(X_train, Y_train)
clfs.append(clf)
print("Number of nodes in the last tree is: {} with ccp_alpha: {}".format(
clfs[-1].tree_.node_count, ccp_alphas[-1]))
Number of nodes in the last tree is: 1 with ccp_alpha: 0.056364969335601575
clfs = clfs[:-1]
ccp_alphas = ccp_alphas[:-1]
node_counts = [clf.tree_.node_count for clf in clfs]
depth = [clf.tree_.max_depth for clf in clfs]
fig, ax = plt.subplots(2, 1,figsize=(10,7))
ax[0].plot(ccp_alphas, node_counts, marker='o', drawstyle="steps-post")
ax[0].set_xlabel("alpha")
ax[0].set_ylabel("number of nodes")
ax[0].set_title("Number of nodes vs alpha")
ax[1].plot(ccp_alphas, depth, marker='o', drawstyle="steps-post")
ax[1].set_xlabel("alpha")
ax[1].set_ylabel("depth of tree")
ax[1].set_title("Depth vs alpha")
fig.tight_layout()
train_scores = [clf.score(X_train, Y_train) for clf in clfs]
test_scores = [clf.score(X_test, Y_test) for clf in clfs]
fig, ax = plt.subplots(figsize=(10,5))
ax.set_xlabel("alpha")
ax.set_ylabel("accuracy")
ax.set_title("Accuracy vs alpha for training and testing sets")
ax.plot(ccp_alphas, train_scores, marker='o', label="train",
drawstyle="steps-post")
ax.plot(ccp_alphas, test_scores, marker='o', label="test",
drawstyle="steps-post")
ax.legend()
plt.show()
index_best_model = np.argmax(test_scores)
best_model = clfs[index_best_model]
print(best_model)
print('Training accuracy of best model: ',best_model.score(X_train, Y_train))
print('Test accuracy of best model: ',best_model.score(X_test, Y_test))
DecisionTreeClassifier(ccp_alpha=0.0006414326414326415, random_state=1) Training accuracy of best model: 0.9928571428571429 Test accuracy of best model: 0.984
High accuracy but recall is more important
recall_train=[]
for clf in clfs:
pred_train3=clf.predict(X_train)
values_train=metrics.recall_score(Y_train,pred_train3)
recall_train.append(values_train)
recall_test=[]
for clf in clfs:
pred_test3=clf.predict(X_test)
values_test=metrics.recall_score(Y_test,pred_test3)
recall_test.append(values_test)
fig, ax = plt.subplots(figsize=(15,5))
ax.set_xlabel("alpha")
ax.set_ylabel("Recall")
ax.set_title("Recall vs alpha for training and testing sets")
ax.plot(ccp_alphas, recall_train, marker='o', label="train",
drawstyle="steps-post")
ax.plot(ccp_alphas, recall_test, marker='o', label="test",
drawstyle="steps-post")
ax.legend()
plt.show()
# creating the model where we get highest train and test recall
index_best_model = np.argmax(recall_test)
best_model = clfs[index_best_model]
print(best_model)
DecisionTreeClassifier(ccp_alpha=0.0006414326414326415, random_state=1)
make_confusion_matrix(best_model,Y_test)
# Recall on train and test
get_recall_score(best_model)
Recall on training set : 0.9667673716012085 Recall on test set : 0.9060402684563759
This appears to be the highest recall amongst the several decision trees created for the test set
#generate a correlation matrix
data.corr()
| Age | Experience | Income | ZIPCode | Family | CCAvg | Education | Mortgage | Securities_Account | CD_Account | Online | CreditCard | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Age | 1.000000 | 0.994215 | -0.055269 | -0.030530 | -0.046418 | -0.052012 | 0.041334 | -0.012539 | -0.000436 | 0.008043 | 0.013702 | 0.007681 |
| Experience | 0.994215 | 1.000000 | -0.046574 | -0.030456 | -0.052563 | -0.050077 | 0.013152 | -0.010582 | -0.001232 | 0.010353 | 0.013898 | 0.008967 |
| Income | -0.055269 | -0.046574 | 1.000000 | -0.030709 | -0.157501 | 0.645984 | -0.187524 | 0.206806 | -0.002616 | 0.169738 | 0.014206 | -0.002385 |
| ZIPCode | -0.030530 | -0.030456 | -0.030709 | 1.000000 | 0.027512 | -0.012188 | -0.008266 | 0.003614 | 0.002422 | 0.021671 | 0.028317 | 0.024033 |
| Family | -0.046418 | -0.052563 | -0.157501 | 0.027512 | 1.000000 | -0.109275 | 0.064929 | -0.020445 | 0.019994 | 0.014110 | 0.010354 | 0.011588 |
| CCAvg | -0.052012 | -0.050077 | 0.645984 | -0.012188 | -0.109275 | 1.000000 | -0.136124 | 0.109905 | 0.015086 | 0.136534 | -0.003611 | -0.006689 |
| Education | 0.041334 | 0.013152 | -0.187524 | -0.008266 | 0.064929 | -0.136124 | 1.000000 | -0.033327 | -0.010812 | 0.013934 | -0.015004 | -0.011014 |
| Mortgage | -0.012539 | -0.010582 | 0.206806 | 0.003614 | -0.020445 | 0.109905 | -0.033327 | 1.000000 | -0.005411 | 0.089311 | -0.005995 | -0.007231 |
| Securities_Account | -0.000436 | -0.001232 | -0.002616 | 0.002422 | 0.019994 | 0.015086 | -0.010812 | -0.005411 | 1.000000 | 0.317034 | 0.012627 | -0.015028 |
| CD_Account | 0.008043 | 0.010353 | 0.169738 | 0.021671 | 0.014110 | 0.136534 | 0.013934 | 0.089311 | 0.317034 | 1.000000 | 0.175880 | 0.278644 |
| Online | 0.013702 | 0.013898 | 0.014206 | 0.028317 | 0.010354 | -0.003611 | -0.015004 | -0.005995 | 0.012627 | 0.175880 | 1.000000 | 0.004210 |
| CreditCard | 0.007681 | 0.008967 | -0.002385 | 0.024033 | 0.011588 | -0.006689 | -0.011014 | -0.007231 | -0.015028 | 0.278644 | 0.004210 | 1.000000 |
def plot_corr(df, size=11):
corr = df.corr()
fig, ax = plt.subplots(figsize=(size, size))
ax.matshow(corr)
plt.xticks(range(len(corr.columns)), corr.columns)
plt.yticks(range(len(corr.columns)), corr.columns)
for (i, j), z in np.ndenumerate(corr):
ax.text(j, i, '{:0.1f}'.format(z), ha='center', va='center')
# To get diferent metric scores
from sklearn.metrics import (
f1_score,
accuracy_score,
recall_score,
precision_score,
confusion_matrix,
roc_auc_score,
plot_confusion_matrix,
precision_recall_curve,
roc_curve,
)
from sklearn import metrics
from sklearn.linear_model import LogisticRegression
# Fit the model on train
model = LogisticRegression(solver="liblinear", random_state=1)
model.fit(X_train, Y_train)
#predict on test
Y_predict = model.predict(X_test)
coef_df = pd.DataFrame(model.coef_)
coef_df['intercept'] = model.intercept_
print(coef_df)
0 1 2 3 4 5 6 \
0 0.001235 -0.00132 0.036132 -0.000067 0.01521 0.009387 0.016434
7 8 9 10 11 intercept
0 0.000833 0.000529 0.004639 -0.000131 -0.000022 -0.000063
model_score = model.score(X_test, Y_test)
print(model_score)
#model scoreing well
0.9073333333333333
cm=metrics.confusion_matrix(Y_test, Y_predict, labels=[1, 0])
df_cm = pd.DataFrame(cm, index = [i for i in ["Actual 1"," Actual 0"]],
columns = [i for i in ["Predict 1","Predict 0"]])
plt.figure(figsize = (7,5))
sns.heatmap(df_cm, annot=True,fmt='g')
plt.show()
The confusion matrix for logical regression
True Positives - 43
True Negatives - 1318
False Positives (FP)- Type I error - 33
False Negatives (FN)Type II error - 106
The confusion matrix for the previous decicion tree
True Positives - 131
True Negatives - 1341
False Positives (FP)- Type I error - 10
False Negatives (FN)Type II error - 18
It appears that the decision tree had fewer type I and II errors.
Both models performed well.
The business would likeley want to target customers with High education levels and high income
They business should ignore the account types of each customer as that had little effect on whether they accepted the loan offer in the campaign